INFO256 Project Report Implementation and Evaluation of Xtract in WordSeer

نویسنده

  • Mosharaf Chowdhury
چکیده

Natural languages are full of word collocations that frequently co-occur and correspond to arbitrary word usages. They appear in both technical and non-technical textual corpora and often have specific significance in individual contexts. Accurately retrieving and identifying collocations from a given corpus in an unsupervised manner is imperative to understanding and automatically generating text related to that corpus. Identifying collocations, however, is not an easy problem. They vary widely in length, in the words that are involved, and in the relations between the involved words. Moreover, words in a collocation can be adjacent or they can be separated by unrelated words. One straightforward way to retrieve collocations is to find the most frequent N -grams in a corpora. This approach is attractive because many collocations are just adjacent words that frequently appear together. WordSeer [1], a text analysis environment for literary corpora from UC Berkeley, uses a variation of theN -gram approach to identify collocations. Consequently, it faces the same shortcomings as anyN -gram-based approach would, namely, it cannot identify collocations with non-adjacent words, it has very high recall (and low precision), and it does not provide any functional information about the identified collocations. Xtract [2] is a statistical tool (i.e., a collection of algorithms) developed for identifying collocations with high precision and for statistically justifying their significance. It extends the N -gram approach with some statistical filters for better effectiveness. In this project, we have successfully implemented theXtract toolkit inWordSeer and compared its performance with the default N -gram-based collocation retrieval mechanism. Furthermore, we have evaluated and compared the performance of both approaches on two large text corpora: Shakespeare and Abstracts. The former is a literary collection of writings by William Shakespeare, and the latter is a technical corpora with abstracts collected from top HCI conferences. The key findings are the following: 1. Xtract is more precise than the N -gram approach. 2. In a show-of-hands poll over a group of 30 experts, Xtract outputs were preferred three out of four times. However, in all cases, a sizable fraction of the participants were undecided.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluation of the Trial Implementation of the Shahaab Project

Evaluation of the Trial Implementation of the Shahaab Project   A. Navidi, Ph.D.*   Shahaab, or Identification and Orientation of Superior Aptitudes, Project’s one-year trial required an assessment of its successes and failures through an opinion survey of a group of teachers, administrators, and experts. Respondents answered open questions both orally and in writing on different aspects of...

متن کامل

Many Moving Parts: Evaluating the Implementation of the Kin KeeperSM Cancer Prevention Intervention

Introduction: This paper details an evaluation of the implementation and feasibility of a multilevel project entitled Kin KeeperSM Cancer Prevention Intervention that utilizes community health workers (CHWs) to deliver breast and cervical cancer education. The evaluation includes intervention fidelity, participant satisfaction, participant retention, and the cost of program implementa...

متن کامل

Analysis of Local Community Perception of Physical-Spatial Reconstruction of Hadi Project Implementation in Roshtkhar District

Introduction Many rural development projects and projects are not being evaluated. Therefore, rural developers rarely know their short-term and long-term implications. On the other hand, in order to achieve efficient service and comprehensive coverage of services in all rural areas, emphasis is placed on reaching the hierarchy of rural areas, and to achieve these two objectives, the preparatio...

متن کامل

The Challenges of a Complex and Innovative Telehealth Project: A Qualitative Evaluation of the Eastern Quebec Telepathology Network

Background The Eastern Quebec Telepathology Network (EQTN) has been implemented in the province of Quebec (Canada) to support pathology and surgery practices in hospitals that are lack of pathologists, especially in rural and remote areas. This network includes 22 hospitals and serves a population of 1.7 million inhabitants spread over a vast territory. An evaluation of this network was conduct...

متن کامل

Review on satisfaction of health system transformation project in Iran: brief report

Background: Long after the implementation of the “Health System Transformation Project”, no comprehensive assessment of patient and nurse satisfaction rate has been carried out in Iran based on available databases. Thus, this review study was designed and performed to answer this question: “How is the evaluation of the Health System Transformation Project in nurse and patient satisfaction dimen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013